Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk Data Model supports per-chunk service event mapping #6744

Merged
merged 33 commits into from
Dec 11, 2024

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Nov 20, 2024

This PR adds support for specifying which service events were emitted in which chunk, by modifying the ChunkBody data structure in a backward compatible manner. Addresses #6622.

Changes

  • Adds ServiceEventCount field to ChunkBody:
    • This field creates an explicit mapping, committed to by Execution Nodes, of service event to chunk. This allows Verification Nodes to know which service events to expect when validating any chunk.
    • This field is defined to be backward-compatible with prior data model versions:
      • Existing serializations of ChunkBody will unmarshal into a struct with a nil ServiceEventCount. We define a chunk with nil ServiceEventCount to have the same semantics as before the field existed: if any service events were emitted, then they were emitted from the system chunk.
      • Post software upgrade, all new (honest) serializations of ChunkBody will always have non-nil ServiceEventCount

Upgrade Notes

Source of truth for upgrade plans (still WIP): https://flowfoundation.notion.site/EFM-Recovery-Release-Upgrade-Plan-WIP-14d1aee1232480228a87e43933815285?pvs=4

Note: Implementation changes associated with the upgrade process will be implemented separately, when the upgrade process is fully specified (see #6777).

#6783 captures changes required around the upgrade behaviour.

To Do Before Merging

  • Ensure consistent hashing between ExecutionResult versions
  • Update description in Remove ChunkBody backward-compatibility #6773
  • Check if necessary to update RPC model (rpc conversion tests fail with non-nil ServiceEventCount field)
  • Should this directly target master / the current spork branch, as this HCU will occur before EFM Recovery? (See upgrade plan)

This PR replaces two prior approaches, implemented in part #6629 and #6730.

@codecov-commenter
Copy link

codecov-commenter commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 70.75472% with 31 lines in your changes missing coverage. Please review.

Project coverage is 41.72%. Comparing base (7c71c41) to head (9a10320).

Files with missing lines Patch % Lines
utils/slices/slices.go 0.00% 9 Missing ⚠️
utils/unittest/encoding.go 0.00% 7 Missing ⚠️
engine/execution/block_result.go 44.44% 4 Missing and 1 partial ⚠️
utils/unittest/fixtures.go 0.00% 4 Missing ⚠️
model/flow/chunk.go 94.23% 2 Missing and 1 partial ⚠️
module/chunks/chunkVerifier.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##           feature/efm-recovery    #6744   +/-   ##
=====================================================
  Coverage                 41.71%   41.72%           
=====================================================
  Files                      2030     2031    +1     
  Lines                    180459   180552   +93     
=====================================================
+ Hits                      75285    75329   +44     
- Misses                    98978    99031   +53     
+ Partials                   6196     6192    -4     
Flag Coverage Δ
unittests 41.72% <70.75%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jordanschalm jordanschalm changed the title DRAFT: Chunk Data Model supports service event indices Chunk Data Model supports service event indices Nov 21, 2024
@AlexHentschel
Copy link
Member

strategic / conceptual thoughts

  1. I am thinking about enforcing activation of the protocol extension (specifically, the usage of the Chunk.ServiceEventIndices field). For example, it would be nice if consensus nodes could drop Execution Results with the deprecated format and avoid incorporating them into a block (and rejecting blocks that still do).

  2. To ensure consensus incorporates only execution receipts following the new convention after a certain height, it would be great if we could include some consistency check also in the receiptValidator (somewhere around here).

  3. I was wondering if your plan is still to remove Chunk.ServiceEventIndices field for Byzantine Fault Tolerance in the long term? I think we had talked about turning ExecutionResult.ServiceEventList into an indexed list. Or are you thinking about keeping the ServiceEventIndices field as the long-term solution -- just with removing the backwards-compatibility case?

    In general, my preference would be to also allow Chunk.ServiceEventIndices = nil when there are no service events generated in the chunk. Thereby the final solution becomes a lot more intuitive:

    • if ExecutionResult.ServiceEvents is not empty (nil allowed), then the new convention requires for consistency:

      $$\sum_\texttt{Chunk} \texttt{len}(Chunk.ServiceEventIndices) = \texttt{len}(\texttt{ExecutionResult.ServiceEventList})\qquad\qquad\qquad\qquad(1) $$

      I feel is check is very similar to other stuff, which consensus nodes already verify about an Execution Receipt (see ReceiptValidator implementation)

    • We would temporary relax the new convention, eq. $(1)$, for downwards compatibility as follows: the ServiceEventIndices fields of all chunks can be nil despite there being service events.

    • As service events are rare, ExecutionResult.ServiceEventList is empty in the majority cases. Then both the deprecated as well as the new conventions would allow ChunkBody.ServiceEventIndices to be nil (which is the most intuitive convention anyway). Also, for individual chunks that don't produce any service events, their ChunkBody.ServiceEventIndices could be nil or empty according to the new convention. Then, also the new convention is very self-consistent in my opinion - and the depreciation condition is only an add on that can be later removed.

I mean is this is only a temporary solution, I am happy and we can skip most of question 3.

@jordanschalm
Copy link
Member Author

jordanschalm commented Nov 25, 2024

Summary of Discussion with @AlexHentschel

Change of ServiceEventIndices structure

  • Replaces list with ServiceEventsNumber, a uint16 which is a count of the number of service events emitted in that chunk
    • Since VNs have access to all chunks, they can easily compute the index range based on this field alone
    • This is a more compact, and fixed-size representation
    • Structural validation is simpler, we require only that sum(chunk.ServiceEventsNumber for chunk in chunks) == len(ServiceEvents)
  • Backward compatibility:
    • If any service events are emitted in a result, and all ServiceEventsNumber fields are 0, then this is interpreted as a v0 model version: all service events must have been emitted in the system chunk.

Removing backward-compatibility at next spork

We plan to keep the overall structure as the long term solution, and only remove the backward-compatibility support at the next spork. We do not plan to use a different representation (ie. Chunk.ServiceEventIndices field).

Upgrade Comments

  • Protocol HCUs are triggered by a ProtocolVersionUpgade service event. This service event is emitted outside the system chunk, meaning that the first Protocol HCU must take place after the service event validation fix has been deployed.
  • We plan to incorporate all changes under feature/efm-recovery in one Protocol HCU.

Rough Outline of Process

  1. Do a manual rolling upgrade of all node roles, to a version including feature/efm-recovery.
  2. Manually verify that all ENs are upgraded (this is the only correctness-critical step!)
    • this is necessary prior to emitting the first ex-system-chunk service event
    • ideally, VNs are also updated, but we can rely on emergency sealing as a fallback if necessary
  3. Emit ProtocolVersionUpgrade service event, scheduling the protocol upgrade at view V.
    • Nodes which are not upgraded when we enter view V will halt.
    • We must have a substantial majority (>> supermajority) of SNs and LNs for the network to remain live, before entering view V (this is the only liveness-critical step!)

@jordanschalm jordanschalm changed the title Chunk Data Model supports service event indices Chunk Data Model supports per-chunk service event mapping Dec 2, 2024
@jordanschalm
Copy link
Member Author

Summary of discussion with @zhangchiqing

Let's call v0 the old version and v1 the new version. Leo pointed out the v0 nodes will fail to validate data models produced by v1 nodes. In particular, this step:

Do a manual rolling upgrade of all node roles, to a version including feature/efm-recovery.

will not work, if v1 nodes immediately begin produce v1 chunks with non-nil ServiceEventsNumber field.

This means the upgrade needs to be split into multiple steps, because v0 software will be unable to read v1 data models with non-nil new fields (will produce different hashes). Instead, we need:

  1. Rolling upgrade from software version v0 to v1 (v1 still produces chunks with nil field)
  2. Protocol HCU, after which v0 nodes cannot progress. At this point, v1 nodes begin producing v1 chunk models with non-nil fields. Chunk models with nil fields are also still accepted (due to sealing lag).
  3. Protocol HCU, after which v0 data models are no longer accepted in new blocks

Because of the above additional complexity, I think we should revert the ProtocolStateVersionUpgrade service event to be emitted in the system chunk (at least to start), because we need Protocol HCUs to safely roll out the breaking chunk model change. After the upgrade is complete, all subsequent service events may be emitted outside the system chunk.

@zhangchiqing
Copy link
Member

I think we can still do with just one HCU, but we might need 2 rolling upgrades. This is the steps, basically Step 1 and Step 3 are rolling upgrades, Step 2 is an protocol HCU:

  1. Rolling upgrade from software version v0 to v1 (v1 still produces chunks with nil field)
    Making sure that:
  • v1 EN will produce v1 result that have chunks with nil field, so that v1 result can be accepted by both v0 SN and v1 SN.
  • v1 result with nil field in chunk produce the same result ID as the v0 result decoded from the v1 result.
  • v1 block with the v1 result that has nil field, also produce the same block ID as the v0 block with the decoded v0 result, so that v1 block can be accepted by both v0 nodes or v1 nodes.
  1. Protocol HCU, after which results with nil field will not be accepted. At this point, v1 ENs begin producing v1 chunk models with non-nil fields for blocks above HCU height. Details:

Once an protocol HCU event is emitted, then:

  • When v1 SN receiving results, it will reject results with nil field below the HCU height.
  • When v1 SN building blocks for height above the HCU height, it will include v1 results with non-nil fields only. Note, due to sealing lag, v1 SN might build some blocks that still have some results with nil field, because they are for blocks below the HCU height.
  • When any v1 nodes receiving blocks, it will reject any block that contains result above the HCU height that has nil field in chunks
  • Blocks produced from v0 SN for blocks above HCU won’t be accepted by v1 SN
  • v0 SN cannot process after HCU because it won’t accept blocks from v1 SN (the block hash doesn’t match)
  1. Rolling upgrade that v0 result or v1 result with nil field will not be accepted in new blocks
  • Once HCU height in step 2 has been sealed, and all ENs are on v1, we can roll out this upgrade, because we are sure all results in new blocks won’t have nil field in chunks.

@jordanschalm
Copy link
Member Author

@zhangchiqing I agree that would work. Though, I suspect we will prefer fewer software upgrades, so we depend less on actions by node operators, but it's good to know that we have flexibility.

I have written a version of the upgrade plan based on our last discussion (with 2 HCUs and 1 rolling upgrade) here. The upgrade process is fairly simple - there is also an enumeration of versions, which I hope can be generalized beyond this specific example.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I struggled with understanding the Chunk Verifier Tests. I think we can fix this with some documentation. They are tests, its fine if the documentation is repetitive.

model/flow/chunk.go Show resolved Hide resolved
// (2) Otherwise, ServiceEventCount must be non-nil.
// Within an ExecutionResult, all chunks must use either representation (1) or (2), not both.
ServiceEventCount *uint16
BlockID Identifier // Block id of the execution result this chunk belongs to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we maybe want to move this to the beginning of the Chunk Body? I think conceptually, that would be more consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but RLP encoding depends on field ordering within structs, so doing this would change the ID computation (unless we over-rode again, using the RLP encoding).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case to validate this a61e30e

model/flow/chunk.go Outdated Show resolved Hide resolved
model/flow/chunk.go Show resolved Hide resolved
model/flow/chunk.go Outdated Show resolved Hide resolved
Comment on lines 292 to 335
// Tests the case where a service event is emitted outside the system chunk
// and the event computed by the VN does not match the Result.
// NOTE: this test case relies on the ordering of transactions in generateCollection.
func (s *ChunkVerifierTestSuite) TestServiceEventsMismatch_NonSystemChunk() {
script := "service event mismatch in non-system chunk"
meta := s.GetTestSetup(s.T(), script, false, true)
vch := meta.RefreshChunkData(s.T())

// modify the list of service events produced by FVM
// EpochSetup event is expected, but we emit EpochCommit here resulting in a chunk fault
epochCommitServiceEvent, err := convert.ServiceEvent(testChain, epochCommitEvent)
require.NoError(s.T(), err)

s.snapshots[script] = &snapshot.ExecutionSnapshot{}
s.outputs[script] = fvm.ProcedureOutput{
ComputationUsed: computationUsed,
ConvertedServiceEvents: flow.ServiceEventList{*epochCommitServiceEvent},
Events: meta.ChunkEvents[:3],
}

_, err = s.verifier.Verify(vch)

assert.Error(s.T(), err)
assert.True(s.T(), chunksmodels.IsChunkFaultError(err))
assert.IsType(s.T(), &chunksmodels.CFInvalidServiceEventsEmitted{}, err)
}

// Tests that service events are checked, when they appear outside the system chunk.
// NOTE: this test case relies on the ordering of transactions in generateCollection.
func (s *ChunkVerifierTestSuite) TestServiceEventsAreChecked_NonSystemChunk() {
script := "service event in non-system chunk"
meta := s.GetTestSetup(s.T(), script, false, true)
vch := meta.RefreshChunkData(s.T())

// setup the verifier output to include the correct data for the service events
output := generateDefaultOutput()
output.ConvertedServiceEvents = meta.ServiceEvents
output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event
s.outputs[script] = output

spockSecret, err := s.verifier.Verify(vch)
assert.NoError(s.T(), err)
assert.NotNil(s.T(), spockSecret)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling convincing myself that we really test the correct edge case. In my mind, we are trying to test the following complementary aspects, with as little gaps as possible::

  1. Situation: a non-service chunk containing a service event (honest). Expected: pass. I think this is tested in TestServiceEventsAreChecked_NonSystemChunk

  2. Exactly the same as situation 1, but that the ConvertedServiceEvents is different.

    I got confused here, because in test TestServiceEventsMismatch_NonSystemChunk too many lines of code are different, which each could be symptom a chunk fault. For me, it would really help if TestServiceEventsMismatch_NonSystemChunk mirrored TestServiceEventsAreChecked_NonSystemChunk with as little changes as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it would really help if TestServiceEventsMismatch_NonSystemChunk mirrored TestServiceEventsAreChecked_NonSystemChunk with as little changes as possible.

They are different because the existing testing infrastructure has very different code-paths for the system chunk and other chunks. Unfortunately I don't think it is feasible to make them more similar without a larger refactor of this test file.

Comment on lines 305 to 310
s.snapshots[script] = &snapshot.ExecutionSnapshot{}
s.outputs[script] = fvm.ProcedureOutput{
ComputationUsed: computationUsed,
ConvertedServiceEvents: flow.ServiceEventList{*epochCommitServiceEvent},
Events: meta.ChunkEvents[:3],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this seems significantly different from Specifically, I don

// setup the verifier output to include the correct data for the service events
output := generateDefaultOutput()
output.ConvertedServiceEvents = meta.ServiceEvents
output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event
s.outputs[script] = output
and I don't understand why this needs to be. In the end, we want to confirm that the verifier catches it if only one detail is different from the honest execution.

Copy link
Member Author

@jordanschalm jordanschalm Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your comment got cut off. The portion of code you linked is constructing an expected output for a transaction in which a service event was emitted (outside the system chunk).

  • Line 328 is assigning the default service events for the output for a non-system-chunk transaction, as the expected output
  • Line 329 is pulling out the 3 events associated with the transaction, and adding it to the expected output
  • Line 330 is inserting the expected output into the map, so the verifier will consider this the canonical output for the transaction

// setup the verifier output to include the correct data for the service events
output := generateDefaultOutput()
output.ConvertedServiceEvents = meta.ServiceEvents
output.Events = meta.ChunkEvents[:3] // 2 default events + 1 service event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we trimming the meta.ChunkEvents here? The chunk events can be more, can't they?

The testing framework has lots of layers, which I am struggling with. Though, to the best of my limited understanding, honest chunk data should be consistent with the verifier's local output. The chunk data here is represented by meta and the derived vch. So to me, the following assignment would make sense:

output.Events = meta.ChunkEvents

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is the expected output for one transaction. The test framework adds 2 events (contents of eventsList) as the expected output for every transaction by default. If specified (new option in this PR), it will additionally add 1 service event (3 total).

module/chunks/chunkVerifier_test.go Outdated Show resolved Hide resolved
module/chunks/chunkVerifier_test.go Outdated Show resolved Hide resolved
@jordanschalm jordanschalm removed the request for review from durkmurder December 9, 2024 22:01
Copy link
Member

@zhangchiqing zhangchiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests. LGTM

model/flow/chunk.go Outdated Show resolved Hide resolved
@jordanschalm jordanschalm merged commit af44135 into feature/efm-recovery Dec 11, 2024
55 checks passed
@jordanschalm jordanschalm deleted the jord/6622-chunk-service-events branch December 11, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants